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Abstract. Reading comprehension is key to knowledge acquisition and to rein- 
forcing memory for previous information. While reading, a mental representation 
is constructed in the reader’s mind. The mental model comprises the words in 
the text, the relations between the words, and inferences linking to concepts in 
prior knowledge. The automated model of comprehension (AMoC) simulates the 
construction of readers’ mental representations of text by building syntactic and 
semantic relations between words, coupled with inferences of related concepts 
that rely on various automated semantic models. This paper introduces the second 
version of AMoC that builds upon the initial model with a revised processing 
pipeline in Python leveraging state-of-the-art NLP models, additional heuristics 
for improved representations, as well as a new radiant graph visualization of the 
comprehension model. 
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1 Introduction 


Comprehension is fundamental to learning. While there is much more to learning (e.g., 
discussion, project building, problem solving), understanding text and discourse repre- 
sents a key starting point when attempting to learn or relearn information. How well a 
reader understands text or discourse depends on many factors, including individual dif- 
ferences such as reading skill, prior knowledge of the domain or world, motivation, and 
goals. Comprehension also depends on the nature of the text — the difficulties imposed 
by the words in the text, the complexity of the syntax, and the flow of the ideas, or 
cohesion. 

Cohesion between ideas can emerge from overlap between explicit words (e.g., 
nouns, verbs), implied words (anaphor), semantically related words, semantically related 
ideas, and the underlying parts of speech (i.e., parts of speech, syntactic overlap). When 
there is greater overlap, text is easier to understand. Cohesion gaps, by contrast, require 
inferences to make connections between the ideas. If the reader has little knowledge of 
the domain or the world, low cohesion text impedes comprehension [1]. For example, 
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if the text is too complex, readers may struggle to understand it or event abandon the 
process; and on the other side, if too simple, readers may quickly lose focus or interest. 
Thus, designing reading materials suited for learners is an important aspect for educators 
as well as writers when targeting a specific audience. 

The automated model of comprehension (AMoC) simulates the mental representa- 
tion constructed by hypothetical readers, by building syntactic and semantic relations 
between words, coupled with inferences of related concepts that rely on various semantic 
models. AMoC offers the user the ability to model various aspects of the reader by mod- 
ifying various parameters related to readers’ knowledge, reading skill, and motivation 
(i.e., activation threshold, maximum active number of concepts per sentence, maximum 
number of semantically related concepts, and the type of knowledge model). This paper 
introduces an updated version of the automated model of comprehension (AMOoC version 
2.0), which is freely available online at http://readerbench.com/demo/amoc. 

AMOC builds on the Construction Integration (CI) model [2], which introduced a 
semi-automated cyclical process to simulate reading, as well as the Landscape Model [3], 
which inherited the ideas from the CI model and provided a visual representation of the 
activation scores belonging to the concepts in the text. We describe a revised version of 
AM0OC that provides several enhancements: a) an improved processing pipeline rewritten 
in Python, b) additional heuristics introduced to better model human constructs, and c) a 
radiant graph visualization to highlight the model’s capabilities. 


2 Method 


The codebase for AMOoC version 2 is developed in Python, rather than Java. This decision 
was influenced by the progress and the interest of the artificial intelligence community 
into libraries written in this programming language such as Tensorflow [4] and Pytorch 
[5], that are frequently used in general neural network projects, and SpaCy [6], which 
is an open source tool for NLP tasks. Additionally, the ReaderBench framework [7], 
previously implemented in Java and used in first version of AMoC, migrated to Python, 
offering improved functionalities based on state-of-the-art NLP models. 

AMOC uses three customizable parameters: minimum activation score (the activation 
score required by a word to be active in the mental model), maximum active concepts 
(the maximum number of words that can be retrieved in the mental model) and maximum 
dictionary expansion (the number of words that can be inferred each sentence). Those 
three parameters and the target text are processed by the model. The processing begins by 
automatically splitting the text into sentences using ReaderBench. For the current study, 
ReaderBench Python uses SpaCy to split and store the relations between words. Next, 
the syntactic graph for each sentence is computed and stored in the model’s memory. 
The difference between the AMoC v1 and v2 is that the coreferences are obtained and 
replaced using NeuralCoref [8] in the later version, while in the older version a Stanford 
Core NLP [9] module was applied; Wolf [10] argues that NeuralCoref obtains better 
overall performance. Additionally, the syntactic parser from SpaCy performs slightly 
better than Stanford CoreNLP [11, 12]: SpaCy UAS 92.07, LAS 90.27 versus Stanford 
CoreNLP UAS 92.0, LAS 89.7. The process includes: 
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1. Each sentence is processed iteratively and each content word (noun or verb) in the 
sentence is added to the graph if it was not present before, or its activation score is 
incremented by | if the reader has previously encountered the concept in the text. 

2. When processing a sentence, the top 5 similar concepts are inferred using WordNet 
(to extract synonyms and hypernyms) and word2vec [13]. The word2vec models 
trained on TASA [14], COCA [15], or Google News [13] are considered to reflect 
different levels of reading proficiency in terms of exposure to language 

3. The inferred words from a sentence are filtered based on two criteria: they must have 
a semantic similarity with the sentence of at least .30 (a value argued by Ratner [16]) 
and they must have a Kuperman Age of Aquisition [17] score < 9 (i.e., the word is 
accessible to an average reader). 

4. Finally, all of the semantic links in the graph are removed, and the semantic nodes are 
sorted based on the similarity with the current sentence. Then, only the top maximum 
dictionary expansion concepts are added to the graph. 


The key differences between the two versions of AMoC are in the second and third 
steps. The first version of AMoC used only the synonyms extracted from WordNet; the 
current version also uses the hypernyms and words extracted from a word2vec language 
model. Also, the filtering process in the older AMoC version did not include the Age of 
Acquisition score to represent the potential difficulty of the words. 

After the semantic nodes are added to the graph, a modified PageRank algorithm 
[18] is run in order to spread activation between concepts and then a normalization step 
is applied. Lastly, after all these operations, nodes become or remain active if they have 
a score above the minimum activation score; otherwise, they are deactivated. 


3 Results 


A demo of AMOoC v2 is available on the ReaderBench website with varying parameters. 
Since the release of the first model the UI was updated with a highly customizable radiant 
graph. Figure | illustrates the last sentence from the “Knight” story used to showcase 
the Landscape Model - http://www. brainandeducationlab.nl/downloads; the caption uses 
TASA as semantic model, a minimum activation threshold of .30, 20 maximum active 
concepts per sentence, and 2 maximum semantically related concepts introduced for 
each word in the original text. The inner circle depicts in blue text-based information 
that is still active, while the outer circle contains semantically inferred concepts in red 
and grayed out inactive concepts. When hovering over a node, the corresponding edges 
are colored, and the related concepts are marked in bold. While considering text-based 
information, “princess” is related to “dragon’’, “armor”, “marry”, and “knight”, whereas 
from a semantic perspective, “princess” is linked to “damsel”, “prince”, and “sword”; 
all concepts and underlying links are adequate concepts for the selected story. 
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Fig. 1. AMoCv2 Radiant graph visualization of the last sentence from the Knight story. 


4 Conclusions and Future Work 


AM0OC provides a fully automated means to model comprehension by leveraging current 
techniques in the Natural Language Processing field. The second version of AMoC 
described in this research provides an improved method and optimizations at the code 
base level in comparison to its predecessor, combined with a more rapid execution time. 
Additionally, a new and highly customizable method for concept graph visualization 
was added to the ReaderBench website. 

In future research, we will further test the predictiveness of AMoC by applying it to 
previous studies that examined text comprehension. From a more technical perspective, 
we intend to evaluate potential advantages of using BERT contextualized embeddings 
[19], rather than word2vec. Our overarching objective is to comprehensively account 
for word senses and their contexts within sentences, paragraphs, texts, and language. 
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